{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "split-aluminum",
      "metadata": {
        "id": "split-aluminum",
        "papermill": {
          "duration": 0.048394,
          "end_time": "2021-04-15T21:06:39.560571",
          "exception": false,
          "start_time": "2021-04-15T21:06:39.512177",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/question-answering/question-answering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/question-answering/question-answering.ipynb)\n",
        "\n",
        "# Question Answering with Similarity Search"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "prospective-turner",
      "metadata": {
        "id": "prospective-turner",
        "papermill": {
          "duration": 0.045573,
          "end_time": "2021-04-15T21:06:39.651272",
          "exception": false,
          "start_time": "2021-04-15T21:06:39.605699",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "This notebook demonstrates how Pinecone's similarity search as a service helps you build a question answering application. We will index a set of questions and retrieve the most similar stored questions for a new (unseen) question. That way, we can link a new question to answers we might already have.\n",
        "\n",
        "You can build a questions answering application with Pinecone in three steps:\n",
        "- Represent questions as vector embeddings so that semantically similar questions are in close proximity within the same vector space. \n",
        "- Index vectors using Pinecone.\n",
        "- Given a new question, query the index to fetch similar questions. This can allow us to store answers associated with these questions \n",
        "\n",
        "In this notebook we will be dealing with indexing a set of quetions and retrieving similar questions for a new and unseen question.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "terminal-export",
      "metadata": {
        "id": "terminal-export",
        "papermill": {
          "duration": 0.044413,
          "end_time": "2021-04-15T21:06:39.741951",
          "exception": false,
          "start_time": "2021-04-15T21:06:39.697538",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "## Dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "expressed-executive",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:06:39.845309Z",
          "iopub.status.busy": "2021-04-15T21:06:39.842494Z",
          "iopub.status.idle": "2021-04-15T21:08:22.163939Z",
          "shell.execute_reply": "2021-04-15T21:08:22.164616Z"
        },
        "id": "expressed-executive",
        "papermill": {
          "duration": 102.376674,
          "end_time": "2021-04-15T21:08:22.165052",
          "exception": false,
          "start_time": "2021-04-15T21:06:39.788378",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "!pip install -qU pinecone-client==3.1.0\n",
        "!pip install -qU matplotlib ipywidgets\n",
        "!pip install -qU sentence-transformers --no-cache-dir"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "another-vintage",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:08:22.264702Z",
          "iopub.status.busy": "2021-04-15T21:08:22.263478Z",
          "iopub.status.idle": "2021-04-15T21:08:23.457180Z",
          "shell.execute_reply": "2021-04-15T21:08:23.457893Z"
        },
        "id": "another-vintage",
        "papermill": {
          "duration": 1.247965,
          "end_time": "2021-04-15T21:08:23.458146",
          "exception": false,
          "start_time": "2021-04-15T21:08:22.210181",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "\n",
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "earned-custom",
      "metadata": {
        "id": "earned-custom",
        "papermill": {
          "duration": 0.045818,
          "end_time": "2021-04-15T21:08:23.550719",
          "exception": false,
          "start_time": "2021-04-15T21:08:23.504901",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "## Pinecone Installation and Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "817f7671",
      "metadata": {},
      "source": [
        "Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a [free API key](https://app.pinecone.io/) and enter it below where we will initialize our connection to Pinecone and create a new index."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "proprietary-animal",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:08:23.648900Z",
          "iopub.status.busy": "2021-04-15T21:08:23.648053Z",
          "iopub.status.idle": "2021-04-15T21:08:24.325452Z",
          "shell.execute_reply": "2021-04-15T21:08:24.324627Z"
        },
        "id": "proprietary-animal",
        "papermill": {
          "duration": 0.730152,
          "end_time": "2021-04-15T21:08:24.325660",
          "exception": false,
          "start_time": "2021-04-15T21:08:23.595508",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from pinecone import Pinecone\n",
        "\n",
        "# initialize connection to pinecone (get API key at app.pinecone.io)\n",
        "api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'\n",
        "\n",
        "# configure client\n",
        "pc = Pinecone(api_key=api_key)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a06a821c",
      "metadata": {},
      "source": [
        "Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "42e0a8bd",
      "metadata": {},
      "outputs": [],
      "source": [
        "from pinecone import ServerlessSpec\n",
        "\n",
        "cloud = os.environ.get('PINECONE_CLOUD') or 'aws'\n",
        "region = os.environ.get('PINECONE_REGION') or 'us-east-1'\n",
        "\n",
        "spec = ServerlessSpec(cloud=cloud, region=region)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5660ad6e",
      "metadata": {},
      "source": [
        "Create the index:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "oriental-ethics",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:08:24.604860Z",
          "iopub.status.busy": "2021-04-15T21:08:24.604038Z",
          "iopub.status.idle": "2021-04-15T21:08:24.607475Z",
          "shell.execute_reply": "2021-04-15T21:08:24.608045Z"
        },
        "id": "oriental-ethics",
        "papermill": {
          "duration": 0.055961,
          "end_time": "2021-04-15T21:08:24.608283",
          "exception": false,
          "start_time": "2021-04-15T21:08:24.552322",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "index_name = \"question-answering\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "extensive-basin",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:08:24.710587Z",
          "iopub.status.busy": "2021-04-15T21:08:24.709749Z",
          "iopub.status.idle": "2021-04-15T21:08:37.484077Z",
          "shell.execute_reply": "2021-04-15T21:08:37.483412Z"
        },
        "id": "extensive-basin",
        "papermill": {
          "duration": 12.827272,
          "end_time": "2021-04-15T21:08:37.484264",
          "exception": false,
          "start_time": "2021-04-15T21:08:24.656992",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "# check if index already exists (it shouldn't if this is first time)\n",
        "if index_name not in pc.list_indexes().names():\n",
        "    # if does not exist, create index\n",
        "    pc.create_index(\n",
        "        index_name,\n",
        "        dimension=300,\n",
        "        metric='cosine',\n",
        "        spec=spec\n",
        "    )\n",
        "    # wait for index to be initialized\n",
        "    while not pc.describe_index(index_name).status['ready']:\n",
        "        time.sleep(1)\n",
        "\n",
        "# connect to index\n",
        "index = pc.Index(index_name)\n",
        "# view index stats\n",
        "index.describe_index_stats()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "explicit-priest",
      "metadata": {
        "id": "explicit-priest",
        "papermill": {
          "duration": 0.049676,
          "end_time": "2021-04-15T21:08:52.139707",
          "exception": false,
          "start_time": "2021-04-15T21:08:52.090031",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "## Uploading Questions"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "defined-magnet",
      "metadata": {
        "id": "defined-magnet",
        "papermill": {
          "duration": 0.046308,
          "end_time": "2021-04-15T21:08:52.233605",
          "exception": false,
          "start_time": "2021-04-15T21:08:52.187297",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "The dataset used in this notebook is the [Quora Question Pairs Dataset](https://www.kaggle.com/c/quora-question-pairs)."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cleared-discussion",
      "metadata": {
        "id": "cleared-discussion",
        "papermill": {
          "duration": 0.047282,
          "end_time": "2021-04-15T21:08:52.327202",
          "exception": false,
          "start_time": "2021-04-15T21:08:52.279920",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "Let's download the dataset and load the data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "id": "finite-bronze",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:08:52.433878Z",
          "iopub.status.busy": "2021-04-15T21:08:52.431642Z",
          "iopub.status.idle": "2021-04-15T21:08:53.592592Z",
          "shell.execute_reply": "2021-04-15T21:08:53.593265Z"
        },
        "id": "finite-bronze",
        "papermill": {
          "duration": 1.219242,
          "end_time": "2021-04-15T21:08:53.593545",
          "exception": false,
          "start_time": "2021-04-15T21:08:52.374303",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "# download dataset from the url\n",
        "import requests\n",
        "\n",
        "DATA_DIR = \"tmp\"\n",
        "DATA_FILE = f\"{DATA_DIR}/quora_duplicate_questions.tsv\"\n",
        "DATA_URL = \"https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv\"\n",
        "\n",
        "\n",
        "def download_data():\n",
        "    os.makedirs(DATA_DIR, exist_ok=True)\n",
        "\n",
        "    if not os.path.exists(DATA_FILE):\n",
        "        r = requests.get(DATA_URL)  # create HTTP response object\n",
        "        with open(DATA_FILE, \"wb\") as f:\n",
        "            f.write(r.content)\n",
        "\n",
        "\n",
        "download_data()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "based-champion",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2021-04-15T21:08:53.699133Z",
          "iopub.status.busy": "2021-04-15T21:08:53.698177Z",
          "iopub.status.idle": "2021-04-15T21:08:55.263678Z",
          "shell.execute_reply": "2021-04-15T21:08:55.262963Z"
        },
        "id": "based-champion",
        "outputId": "82d68574-7dc7-49c8-edf3-f787705e18e1",
        "papermill": {
          "duration": 1.623567,
          "end_time": "2021-04-15T21:08:55.263866",
          "exception": false,
          "start_time": "2021-04-15T21:08:53.640299",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "     qid1  \\\n",
            "0  216488   \n",
            "1  424959   \n",
            "2  300233   \n",
            "3  302677   \n",
            "4  468590   \n",
            "\n",
            "                                                                                                                                     question1  \n",
            "0                                                                                               I would love to give a TED talk. What do I do?  \n",
            "1                                                                Do all caps titles on YouTube videos attract more viewers than normal titles?  \n",
            "2                                                                                                How do I start self-learning ethical hacking?  \n",
            "3                                                                           Should learning musical instruments in schools be made compulsory?  \n",
            "4  Does the success of a self proclaimed Acharya Pankaj Pathak in Assam prove that we, as a state, are regressing back instead of progressing?  \n"
          ]
        }
      ],
      "source": [
        "pd.set_option(\"display.max_colwidth\", 500)\n",
        "\n",
        "df = pd.read_csv(\n",
        "    f\"{DATA_FILE}\", sep=\"\\t\", usecols=[\"qid1\", \"question1\"], index_col=False\n",
        ")\n",
        "df = df.sample(frac=1).reset_index(drop=True)\n",
        "df.drop_duplicates(inplace=True)\n",
        "df['qid1'] = df['qid1'].apply(str)\n",
        "df['question1'] = df['question1'].apply(str)\n",
        "print(df.head())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "quality-slovak",
      "metadata": {
        "id": "quality-slovak",
        "papermill": {
          "duration": 0.048216,
          "end_time": "2021-04-15T21:08:55.359112",
          "exception": false,
          "start_time": "2021-04-15T21:08:55.310896",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "### Define the model"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "respected-kennedy",
      "metadata": {
        "id": "respected-kennedy",
        "papermill": {
          "duration": 0.046535,
          "end_time": "2021-04-15T21:08:55.452389",
          "exception": false,
          "start_time": "2021-04-15T21:08:55.405854",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "We will use the [Averarage Word Embeddings Model](https://nlp.stanford.edu/projects/glove/) for this example. This model has a high computation speed but relatively low quality of embeddings. You can look into other sentence embeddings models such as the [Sentence Embeddings Models trained on Paraphrases](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) for improving quality of embeddings. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "id": "incredible-attribute",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 273,
          "referenced_widgets": [
            "7e2c6790c8ed4ee7aceb83c2e6a522ca",
            "dbbd52c2f1124309bd666545edbdc2a0",
            "276ec728b0ae4e30a5cfb19c605ec1c7",
            "a592d29e95e4405185e8c2cd90b14d8b",
            "be07cd42ec8146598da9de037657fed7",
            "3be97998d9bf446483a6f613ff90d98b",
            "0acee025d65f43b4a77a7cd9cc3a471c",
            "91056eb4c0c742a5a963092f11dca4bc",
            "778a6a2a66734df9a2c328f1b9fa582b",
            "3b319fed1f7d4b77bcca43f112a67df6",
            "ad3bfb0794f34672a60d127af45b45ed",
            "52ea4e6ddef34309bbd4698a003abd60",
            "1a826f2a99d74f348f9dd8de10420435",
            "ce1e662b6b0b429bab7306f142f06eb9",
            "e52f0cb98ae14aa884af2d37b75263fb",
            "1264d5dc22d54a8dbe4e76b211ca8b7f",
            "f10831650c5242079cb397114b4e8067",
            "1b0ad017e4f84d2487ba79165fb12baa",
            "7a7d860b3e76461fa55b5907210b167c",
            "6a6d90fca151440aa739068c714cbd1d",
            "c1972458791543719a69a704b5bd773a",
            "23e7bece21374ea383c0b6444785feff",
            "55b36423592d4a3796fe6564631f2b6b",
            "e74bd716c5c34e2b8082ada02518bdaf",
            "c3a82771809c42e7ad99347bd9851f07",
            "d7bff0d9ebad46dbb8077746760930c9",
            "001cca66573242a5a7976603d1977b55",
            "e02e7c3e80f346ef826e9f7338b108d0",
            "9810878b1fc542f7ab23d762183ad86b",
            "924e7bbe27d5470aa530299470a97bb2",
            "de4dd1bfc9094e0480adc3fae7763abc",
            "24ad296b635d447499397031445036b1",
            "60d9fa663ec24cca8059ccdb5e7388ef",
            "c69443620225478fb7a1cbdaea622eb6",
            "75d6d93343aa4850af0a2f23be11389f",
            "58e03d29c0724b40b2f38c0eab89f859",
            "a0dccffad3384b15a443f46951cbbd94",
            "1956a767c27f4dc385044bd31ef8da01",
            "64e82add23d84518a43c2533b6844026",
            "c7ed8a5e2d9140c7bc2d39657cc01520",
            "e53b76a4b1704e04b3054a86a5dd3275",
            "1edd8fe4d6384c5ea55eba7864bbd7f8",
            "ed24d08e4f69470fbb301ae9017d26a2",
            "ae4df5b31f8d4ab389c1cfff1dd9be73",
            "8d957bdc122543d4bf33625eb7ac1cb5",
            "3e92ae1c3b1b4226b568ecdcf5f5288a",
            "02ef91f789824c0fbd3897fdd9c72683",
            "eec847f6a35d41a0a218e19d2fdb97a4",
            "8e301f0162bd48fca85eee5ba3cac2c7",
            "83ba0a0c80f44d0e8fecb52760c47d15",
            "543ede30b55346fd9d00091ba336462b",
            "dcc121dbacc14e5c8106a0d20309bfe0",
            "5313a3a23faa4b7fb96fa897cc601938",
            "b080e33429474b58bb2a55a31ca7c44f",
            "b9da7486c0a64de8b745786eee885bd0",
            "2e69c5a3192545e6804ba92e560cd5a0",
            "d9203ea31f304c3ca2d926495317b08f",
            "460765ba3be2455fabaace51c9ae32d7",
            "5dfc44918ced4534a0139ef1ca029d98",
            "5c7b293ca8134955b261b454c19a5930",
            "00e82b59de1b4aacabae6275aecdc4eb",
            "8af5dad815d144ee92b763b483d0f2e4",
            "590d0cb6f38a482490edface43250b99",
            "114fba9d1ecc42b7987b4d154352f0b7",
            "e20d5861468e46e4ad0c09380a8d2cae",
            "19c4a1a3044d43e89c5f89e1dcda3d1c",
            "eebca5f0920b4089966cf591f4f941d7",
            "d698709b42b0427ba7fda8e6a2362f08",
            "2ef083030bcc473badf56f252da1728d",
            "b1df798939d141c892f74cf37acdd6e2",
            "975de27e3edd463ab847c12728cc66f1",
            "5d8226a0a1224b34b2958a198bef8146",
            "efe1e226ee8146b9af66b5c75e0a12fd",
            "a3da36543151465b9af3583b8a263962",
            "272f7bdbb50949c8b0b6e12e002cb423",
            "121752ef77f0492899215190ab4ed43e",
            "0fd9fba8480340dab2d4ed07f9676e58",
            "0d67ac84cd6847329406147078807401",
            "77161a5160ca484e9a8626b9f56b6058",
            "3c601677c85b47369eccabe2f464fe82",
            "05d28cb1ebc449d1a92b83d6fdd77b52",
            "8943e03721d64a8ea1f81e281363b91f",
            "d98a950080f640789dec109d7ebef6ae",
            "56ce8f12e3ba4cc098ad6af938bdedc7",
            "bf0fd0340efe4d6a848b64b3dd7f80da",
            "d892b28e25704926bbe8eea1f67c84ea",
            "403b9fbb2c194915a51b7668dc749ac7",
            "fb4aa3931ed94b8087b2c896ee9dd1ce"
          ]
        },
        "execution": {
          "iopub.execute_input": "2021-04-15T21:08:55.555456Z",
          "iopub.status.busy": "2021-04-15T21:08:55.554079Z",
          "iopub.status.idle": "2021-04-15T21:09:24.460292Z",
          "shell.execute_reply": "2021-04-15T21:09:24.459479Z"
        },
        "id": "incredible-attribute",
        "outputId": "66c7a230-2fa9-4217-9cbd-fcd5b161d3c4",
        "papermill": {
          "duration": 28.961377,
          "end_time": "2021-04-15T21:09:24.460508",
          "exception": false,
          "start_time": "2021-04-15T21:08:55.499131",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "7e2c6790c8ed4ee7aceb83c2e6a522ca",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "52ea4e6ddef34309bbd4698a003abd60",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/480M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "55b36423592d4a3796fe6564631f2b6b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/4.61M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "c69443620225478fb7a1cbdaea622eb6",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/164 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8d957bdc122543d4bf33625eb7ac1cb5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "2e69c5a3192545e6804ba92e560cd5a0",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/2.15k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "eebca5f0920b4089966cf591f4f941d7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0d67ac84cd6847329406147078807401",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading:   0%|          | 0.00/248 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "import torch\n",
        "from sentence_transformers import SentenceTransformer\n",
        "\n",
        "# set device to GPU if available\n",
        "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
        "# load the model from huggingface model hub\n",
        "model = SentenceTransformer(\"average_word_embeddings_glove.6B.300d\", device=device)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "leading-savings",
      "metadata": {
        "id": "leading-savings",
        "papermill": {
          "duration": 0.048413,
          "end_time": "2021-04-15T21:09:24.556921",
          "exception": false,
          "start_time": "2021-04-15T21:09:24.508508",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "### Creating Vector Embeddings"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "id": "serial-custody",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "f83aba19d401413f966a61ed23086d9b",
            "ebcc1823aeda4fb0ae67dc566012fd7b",
            "48a500162d854cba904ccc8424ed1da3",
            "e6ac253c1c5c468ca423dc1ec3758b02",
            "1b96c29230434e36974590e2abe5e1b4",
            "b93cea36847f4f22aa83318f07b673ae",
            "936d521e08574f5d82f9b530b4d94d56",
            "c8f29c9b12424d94b9164e180b24b2c6",
            "288830db9014466d8efffe4587348310",
            "5525aee8dff643e68da0ef945fd7a87e",
            "a52a7903354a4d61886499c7043f752d"
          ]
        },
        "execution": {
          "iopub.execute_input": "2021-04-15T21:09:24.673539Z",
          "iopub.status.busy": "2021-04-15T21:09:24.660954Z",
          "iopub.status.idle": "2021-04-15T21:11:36.568996Z",
          "shell.execute_reply": "2021-04-15T21:11:36.568047Z"
        },
        "id": "serial-custody",
        "outputId": "2ccdfb03-e3fd-48cb-958d-73541728cfa2",
        "papermill": {
          "duration": 131.964267,
          "end_time": "2021-04-15T21:11:36.569271",
          "exception": false,
          "start_time": "2021-04-15T21:09:24.605004",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "f83aba19d401413f966a61ed23086d9b",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/9083 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "# create embedding for each question\n",
        "question_vectors = model.encode(list(df.question1), show_progress_bar=True).tolist()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "id": "w_amwu50iCtj",
      "metadata": {
        "id": "w_amwu50iCtj"
      },
      "outputs": [],
      "source": [
        "# add question embeddings to dataframe\n",
        "df[\"question_vector\"] = question_vectors"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "confused-syndrome",
      "metadata": {
        "id": "confused-syndrome",
        "papermill": {
          "duration": 0.048476,
          "end_time": "2021-04-15T21:11:36.665995",
          "exception": false,
          "start_time": "2021-04-15T21:11:36.617519",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "### Index the Vectors"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "id": "c1ca6e02",
      "metadata": {
        "id": "c1ca6e02"
      },
      "outputs": [],
      "source": [
        "import itertools\n",
        "\n",
        "def chunks(iterable, batch_size=100):\n",
        "    it = iter(iterable)\n",
        "    chunk = tuple(itertools.islice(it, batch_size))\n",
        "    while chunk:\n",
        "        yield chunk\n",
        "        chunk = tuple(itertools.islice(it, batch_size))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "id": "altered-touch",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:11:36.799785Z",
          "iopub.status.busy": "2021-04-15T21:11:36.798412Z",
          "iopub.status.idle": "2021-04-15T21:11:56.201883Z",
          "shell.execute_reply": "2021-04-15T21:11:56.201192Z"
        },
        "id": "altered-touch",
        "papermill": {
          "duration": 19.488365,
          "end_time": "2021-04-15T21:11:56.202071",
          "exception": false,
          "start_time": "2021-04-15T21:11:36.713706",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "for batch in chunks(zip(df.qid1, df.question_vector)):\n",
        "    index.upsert(vectors=batch)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "large-proportion",
      "metadata": {
        "id": "large-proportion",
        "papermill": {
          "duration": 0.049182,
          "end_time": "2021-04-15T21:11:56.302631",
          "exception": false,
          "start_time": "2021-04-15T21:11:56.253449",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "## Search"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "arctic-foster",
      "metadata": {
        "id": "arctic-foster",
        "papermill": {
          "duration": 0.048842,
          "end_time": "2021-04-15T21:11:56.400092",
          "exception": false,
          "start_time": "2021-04-15T21:11:56.351250",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "Once you have indexed the vectors it is very straightforward to query the index. These are the steps you need to follow:\n",
        "- Select a set of questions you want to query with\n",
        "- Use the Average Embedding Model to transform questions into embeddings.\n",
        "- Send each question vector to the Pinecone index and retrieve most similar indexed questions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "id": "friendly-circular",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2021-04-15T21:11:56.515019Z",
          "iopub.status.busy": "2021-04-15T21:11:56.509251Z",
          "iopub.status.idle": "2021-04-15T21:11:56.733221Z",
          "shell.execute_reply": "2021-04-15T21:11:56.732512Z"
        },
        "id": "friendly-circular",
        "outputId": "0c908149-0202-4fc8-f33e-463d2a355017",
        "papermill": {
          "duration": 0.284271,
          "end_time": "2021-04-15T21:11:56.733406",
          "exception": false,
          "start_time": "2021-04-15T21:11:56.449135",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "\n",
            "\n",
            " Original question : What is best way to make money online?\n",
            "\n",
            " Most similar questions based on pinecone vector search: \n",
            "\n",
            "       id                                             question     score\n",
            "0      57               What is best way to make money online?  1.000000\n",
            "1  297469           What is the best way to make money online?  1.000000\n",
            "2   55585        What is the best way for making money online?  0.989930\n",
            "3   28280         What are the best ways to make money online?  0.981526\n",
            "4  157045  What is the best way to make money on the internet?  0.978538\n",
            "\n",
            "\n",
            "\n",
            " Original question : How can i build an e-commerce website?\n",
            "\n",
            " Most similar questions based on pinecone vector search: \n",
            "\n",
            "       id                                                   question     score\n",
            "0  119383                   How can I develop an e-commerce website?  0.925466\n",
            "1    1713                 How would I develop an e-commerce website?  0.925466\n",
            "2    1714                     How do I create an e-commerce website?  0.919407\n",
            "3   79063             How do I build and host an e-commerce website?  0.918379\n",
            "4  245780  What is the best platform to build an e-commerce website?  0.894444\n"
          ]
        }
      ],
      "source": [
        "# define questions to query the vector index\n",
        "query_questions = [\n",
        "    \"What is best way to make money online?\",\n",
        "    \"How can i build an e-commerce website?\"\n",
        "]\n",
        "\n",
        "# extract embeddings for the questions\n",
        "query_vectors = model.encode(query_questions).tolist()\n",
        "\n",
        "# query pinecone\n",
        "query_results = [index.query(vector=xq, top_k=5) for xq in query_vectors]\n",
        "\n",
        "# show the results\n",
        "for question, res in zip(query_questions, query_results):\n",
        "    print(\"\\n\\n\\n Original question : \" + str(question))\n",
        "    print(\"\\n Most similar questions based on pinecone vector search: \\n\")\n",
        "\n",
        "    ids = [match.id for match in res.matches]\n",
        "    scores = [match.score for match in res.matches]\n",
        "    df_result = pd.DataFrame(\n",
        "        {\n",
        "            \"id\": ids,\n",
        "            \"question\": [\n",
        "                df[df.qid1 == _id].question1.values[0] for _id in ids\n",
        "            ],\n",
        "            \"score\": scores,\n",
        "        }\n",
        "    )\n",
        "    print(df_result)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "essential-health",
      "metadata": {
        "id": "essential-health",
        "papermill": {
          "duration": 0.050565,
          "end_time": "2021-04-15T21:11:56.835393",
          "exception": false,
          "start_time": "2021-04-15T21:11:56.784828",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "## Delete the Index"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "opposite-visitor",
      "metadata": {
        "id": "opposite-visitor",
        "papermill": {
          "duration": 0.051167,
          "end_time": "2021-04-15T21:11:56.937286",
          "exception": false,
          "start_time": "2021-04-15T21:11:56.886119",
          "status": "completed"
        },
        "tags": []
      },
      "source": [
        "Delete the index once you are sure that you do not want to use it anymore. Once it is deleted, you cannot reuse it.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "id": "rocky-selling",
      "metadata": {
        "execution": {
          "iopub.execute_input": "2021-04-15T21:11:57.047606Z",
          "iopub.status.busy": "2021-04-15T21:11:57.046545Z",
          "iopub.status.idle": "2021-04-15T21:12:09.690178Z",
          "shell.execute_reply": "2021-04-15T21:12:09.689514Z"
        },
        "id": "rocky-selling",
        "papermill": {
          "duration": 12.701629,
          "end_time": "2021-04-15T21:12:09.690372",
          "exception": false,
          "start_time": "2021-04-15T21:11:56.988743",
          "status": "completed"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "pc.delete_index(index_name)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "question_answering.ipynb",
      "provenance": []
    },
    "environment": {
      "name": "tf2-gpu.2-3.m65",
      "type": "gcloud",
      "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-3:m65"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.7 (main, Sep 14 2022, 22:38:23) [Clang 14.0.0 (clang-1400.0.29.102)]"
    },
    "papermill": {
      "default_parameters": {},
      "duration": 333.240754,
      "end_time": "2021-04-15T21:12:11.363566",
      "environment_variables": {},
      "exception": null,
      "input_path": "/notebooks/question_answering/question_answering.ipynb",
      "output_path": "/notebooks/tmp/question_answering/question_answering.ipynb",
      "parameters": {},
      "start_time": "2021-04-15T21:06:38.122812",
      "version": "2.3.3"
    },
    "vscode": {
      "interpreter": {
        "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}